Corpora and data preparation
نویسندگان
چکیده
The data selection and data preparation efforts which led to the TIPSTER and Fifth Message Understandin g Conference (MUC-5) evaluation corpora involved substantial effort, time and resources . The Government commitment to these selection and preparation efforts stems from four TIPSTER Program objectives : (1) to provide trainin g data that would promote the development of information extraction technology, (2) to provide accurate test data t o evaluate and baseline system performance in an objective manner, (3) to provide a baseline for human performance t o understand and interpret machine performance, and (4) to support the larger Natural Language Processing community by making available a unique set of texts and templates in multiple domains and languages under ARPA support . This commitment was demonstrated through the managerial, technical, and administrative support to these efforts from various Government agencies, as well as through the contractual efforts with the Institute for Defense Analyses for data preparation and New Mexico State University for software tool development .
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملCorpora and Data Preparation for Information Extraction
The data selection and data preparation efforts which led to the TIPSTER and Fifth Message Understanding Conference (MUC-5) corpora involved substantial effort, time and resources. The Government commitment to these selection and preparation efforts stems from four TIPSTER Program objectives: (1) to provide training data that would promote the development of information extraction technology, (...
متن کاملThe MMSR bilingual and crosschannel corpora for speaker recognition research and evaluation
We describe efforts to create corpora to support and evaluate systems that meet the challenge of speaker recognition in the face of both channel and language variation. In addition to addressing ongoing evaluation of speaker recognition systems, these corpora are aimed at the bilingual and crosschannel dimensions. We report on specific data collection efforts at the Linguistic Data Consortium, ...
متن کاملExperiments in Medical Translation Shared Task at WMT 2014
This paper describes Dublin City University’s (DCU) submission to the WMT 2014 Medical Summary task. We report our results on the test data set in the French to English translation direction. We also report statistics collected from the corpora used to train our translation system. We conducted our experiment on the Moses 1.0 phrase-based translation system framework. We performed a variety of ...
متن کاملDesign and Preparation of the 1996 Hub-4 Broadcast News Benchmark Test Corpora
This paper describes the procedures used in the preparation of the 1996 DARPA CSR Hub-4 Broadcast News Benchmark Test corpora and some analyses of that data. A new annotation/transcription process was designed and implemented to ensure that the transcripts were practically error-free and to negate the need to hold a post-test Aadjudication@ as in years past. This paper focuses on this new annot...
متن کامل